Project Overview

Motivation

Bikesharing has become more and more popular in major cities in the United States “Americans are falling in love with bikeshare”. The Bluebikes is a Metro Boston’s public bike share program, with over 1,800 bikes and more than 200 stations across Boston, Brookline, Cambridge, and Somerville. Bluebike operates under the customer and subscriber systems. In 2017, the total number of trips is 1,313,837, and the number of subscribers is 14,577. For the individual customer, the single trip pass costs 2.5 dollars for 30 mins ride, with 2.5 dollars for an additional 30 mins. For the subscriber, Bluebike requires a membership fee of 99 dollars per year and the subscriber can have unlimited rides, with each ride under 45 mins. For any additional 30 mins post 45 mins ride, the subscriber would need to pay $2.5.

This graph shows that the number of Bluebike subscribers and customers had increased dramatically from 2011 to 2013 and has maintained the high number of riders until 2017.

This line chart shows the total number of the Bluebike trips, which could count multiple rides of one user. The number of trips also increased in the first part of the years as the number of users did. On the contrary to the number of riders, the trip counts have still increased in these few years.

Objectives

For this project, we would like to understand

  • What are the general demographics of bike riders?
  • How does the frequency of bike riding change with time, day and month of the year?
  • Does weather affect bike riding?
  • Given Bluebike only collects information on subscribers, how can we predict the characteristics and conditions for overtime users?

Approach

Starting with web scrapping we collected datasets from Boston Hubway and weather sites, focusing on the year 2017. We merged these datasets into a single csv file for our subsequent analysis. We performed exploratory analysis of bluebike riders in Boston, with the hopes of trying to have a general idea of who a typical Blue bike subscriber was, and how their riding pattern changed with changing weather. We then drilled down on bluebike subscribers who ride overtime, meaning more than 45 mins. Since these people pay an extra $2.5 per 30mins, and thus make bluebike company extra revenue. This informed our subsequent analysis, using multiple logistic regression techniques and Decision tree analysis. We tried to predict the typical oversubscriber and what day of the week such a rider would be out and about.

Data Preparation

# Importing the all Bluebike data in 2017
dat_201701<-read.csv("201701-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201702<-read.csv("201702-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201703<-read.csv("201703-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201704<-read.csv("201704-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201705<-read.csv("201705-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201706<-read.csv("201706-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201707<-read.csv("201707-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201708<-read.csv("201708-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201709<-read.csv("201709-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201710<-read.csv("201710-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201711<-read.csv("201711-hubway-tripdata.csv",stringsAsFactors = FALSE)
dat_201712<-read.csv("201712-hubway-tripdata.csv",stringsAsFactors = FALSE)

# Combining the data
dat_2017 <- rbind(dat_201701, dat_201702, dat_201703, dat_201704, dat_201705,
                  dat_201706, dat_201707, dat_201708, dat_201709, dat_201710,
                  dat_201711, dat_201712)

Adding variables

# age, age_cat, duration_min, year, month, month_abb, day, hour, wday
bbike <- dat_2017 %>%
  mutate(birth.year = as.numeric(birth.year)) %>%
        mutate(age = 2017 - birth.year) %>%
  mutate(age_cat = case_when(
    .$age >= 10 & .$age < 20 ~ 1,
    .$age >= 20 & .$age < 30 ~ 2,
    .$age >= 30 & .$age < 40 ~ 3,
    .$age >= 40 & .$age < 50 ~ 4,
    .$age >= 50 & .$age < 60 ~ 5,
    .$age >= 60 & .$age < 70 ~ 6,
    .$age >= 70 & .$age < 80 ~ 7,
    .$age >= 80 ~ 8)) %>%
  mutate(duration_min = tripduration / 60) %>%
  mutate(year = year(starttime), 
         month = month(starttime),
         month_abb = month(starttime, label = TRUE, abbr = TRUE), 
         day = day(starttime),
         hour = hour(starttime), 
         wday = wday(starttime, label = TRUE, abbr = TRUE))
# dist_km: Trip distance (km)
setDT(bbike)[ , dist_km := distGeo(matrix(c(start.station.longitude, start.station.latitude), ncol = 2),matrix(c(end.station.longitude, end.station.latitude), ncol = 2))/1000]
bbike <- as.data.frame(bbike)
# overtime
bbike <- bbike %>%
  mutate(overtime = ifelse(duration_min > 45, 1, 0))
# user_start, user_end: number of users at the start/end station
bbike <- bbike %>%
  group_by(start.station.id) %>%
  mutate(user_start = n())

bbike <- bbike %>%
  group_by(end.station.id) %>%
  mutate(user_end = n())

Combining weather information

## temp_max, temp_min, rain, snownice(snow or ice)
weather <- read_excel("boston_weather.xls")
bbike <- bbike %>%
  group_by(year, month, day) %>%
  left_join(., weather, by = c("year", "month", "day")) 

Analysis

Number of trips by gender and age

bbike %>%
  mutate(gender = as.factor(gender)) %>%
  filter(gender != "0") %>%  #1 is male, 2 is female, 0 is unknown
  ggplot(aes(factor(age_cat))) +
  geom_bar(aes(fill = gender)) +
  scale_x_discrete("Age", 
                   labels = c('10s','20s','30s','40s','50s','60s','70s','80+')) +
  scale_fill_manual("Gender", values = c("#2CA3E1", "#FF8F1C"), labels = c("Male", "Female")) +
  scale_y_continuous("Number of Rides") +
  ggtitle("Total Number of Trips by Gender and Age") +
  theme_bw()

In this bar chart, people with gender-code “0” were removed from the analysis. Age was stratified into bands of 10 years in width and shows as categorical variable for comparison. Main Blue Bike users are 20s or 30s, and male users are dominant in each generation.

Blue bike Stations

# Create data frame
dat_station<- read.csv("Hubway_Stations_as_of_July_2017.csv")
dat_2017_station <- bbike %>%
filter(!birth.year=="\\N")%>%
                        filter(birth.year>1900 & birth.year<=2017)%>%
  group_by(start.station.id) %>% summarize(number = n(), start.station.latitude=first(start.station.latitude), start.station.longitude=first(start.station.longitude),start.station.name=first(start.station.name))%>% filter(start.station.latitude>0)
summary(dat_2017_station)
##  start.station.id     number      start.station.latitude
##  Min.   :  1.00   Min.   :    4   Min.   :42.30         
##  1st Qu.: 54.75   1st Qu.: 1794   1st Qu.:42.34         
##  Median :108.50   Median : 4750   Median :42.36         
##  Mean   :111.91   Mean   : 5625   Mean   :42.36         
##  3rd Qu.:173.25   3rd Qu.: 7614   3rd Qu.:42.37         
##  Max.   :232.00   Max.   :35702   Max.   :42.41         
##  start.station.longitude start.station.name
##  Min.   :-71.17          Length:196        
##  1st Qu.:-71.11          Class :character  
##  Median :-71.08          Mode  :character  
##  Mean   :-71.09                            
##  3rd Qu.:-71.06                            
##  Max.   :-71.01

Mapping about number of users at each station

# Distinguish stations by color based on the number of users
getColor <- function(df){
  sapply(df$number, function(number) {
  if(number %in% 1:4000) {
    "green"
  } else if(number %in% 4001:10000) {
    "orange"
  } else if(number >10000) {
    "red"
  } else {
    "blue"
  } })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = getColor(dat_2017_station)
)

leaflet(dat_2017_station) %>% addTiles() %>%
  addAwesomeMarkers(~start.station.longitude
, ~start.station.latitude, icon=icons, label=~as.character(number), popup = ~start.station.name)

This map classified Blue Bike stations into three categories based on the number of trips in 2017: Red stations are most popular, orange stations are moderate and green stations are least popular. The most popular station was “MIT at Mass Ave / Amherst St”, at the number of trips 35,700 per year. The least popular station was “Four Corners – 157 Washington St”, at the number of trips just four per year.

Number of trips in each hour within a day through Sunday to Saturday

bbike %>% ggplot(aes(hour))+
  geom_bar(fill = "#006AC6", color = "#0050B5")+
  scale_x_continuous(breaks=(c(0,4,8,12,16,20)))+
  xlab("Time")+
  ylab("Number of Users") +
  facet_wrap(~ wday) +
  theme_fivethirtyeight()

These bar graphs show the number of trips in each hour within a day through Sunday to Saturday. We can see biphasic trend of 8 am and 5 pm on weekday, which are consistent with commute time. On the other hand, we could see a completely different trend on weekend.

bbike %>% 
  filter(gender != 0) %>%
  ggplot(aes(hour)) +
  geom_bar(aes(fill = factor(gender))) +
  scale_x_continuous(breaks=(c(0,4,8,12,16,20))) +
  scale_fill_manual("Gender", values = c("#0090DA", "#FF8F1C"), 
                      labels = c("Male", "Female")) +
  xlab("Time")+
  ylab("Number of users") +
  facet_wrap(~ month) +
  theme_economist_white()

The figure shows seasonal comparison of users throughout a year. As expected, we see small number of trips in winter.

Distribution of Trip Duration

bbike_member <- bbike %>%
  filter(usertype == "Subscriber")
summary(bbike_member$duration_min)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.02     6.10     9.78    13.31    15.48 61276.80

As you can see, the data seems to have wrong information. The very long tripduration might be attributable to lost or other errors. Therefore, we limit the range from 0 to 75 minutes for the duration in this study.

bbike_member <- bbike_member %>%
  filter(duration_min < 50)
bbike_member %>%
  mutate(group = ifelse(rain == 0, "no rain", "rain")) %>%
  ggplot(aes(duration_min, y = ..count.., fill = group)) +
  geom_density(alpha = 0.2) +
scale_fill_discrete("Rain", labels = c("No rain", "Rain")) +
  xlab("Trip Duration (min)") +
  ylab("Riders") +
  theme_bw()

Since the maximum trip duration 61276.80 minutes, we decided to cap our analysis at 75mins. As a bike-leasing organization, a very long trip duration might probably be due to a missing bike or some unknown reason.

Unsurprisingly, we see that many riders use the bikes when the weather is more conducive. The number of blue-bike users when it is not raining is more than double the number of people who use the bikes in the rain.

bbike_member %>%
  filter(gender != 0) %>%
  mutate(gender = as.factor(gender)) %>%
  ggplot(aes(duration_min, y = ..count.., fill = gender)) +
  geom_density(alpha = 0.2) +
  scale_fill_manual("Gender", values = c("#0090DA", "#FF8F1C"),
                    labels = c("Male", "Female")) +
  xlab("Trip Duration (min)") +
  ylab("Riders") +
  theme_bw()

This gender difference was however not very obvious after about 40mins. Again we can infer from this graphic that, gender played little role among over time riders.

Trip distance

# Distance
bbike %>% 
  filter(birth.year > 1900 & birth.year <= 2017)%>%
  group_by(age_cat) %>% 
  summarize(avg = mean(dist_km), se = sd(dist_km) / sqrt(n())) %>% 
  ggplot(aes(age_cat, avg))+ 
  geom_errorbar(aes(ymin = avg - 2*se, ymax = avg+2 * se), width = 0.2, alpha = 0.5)+
  geom_point(color = "#FF8F1C")+
  geom_line(color = "#1D428A")+
  scale_x_continuous(breaks=(c(1,2,3,4,5,6,7,8)), 
                     labels=c("10s","20s","30s","40s","50s","60s","70s","80+"))+
  xlab(expression(paste(Age, " (years)")))+
  ylab(expression(paste(Distance," (km)"))) +
  ggtitle("Bike rides and Gender") +
  theme_bw()

Here there’s an initial upward trend we see with age and distance covered in kilometers. As age increased from the teens to about age 50, the distance covered also increased. This positive linear relationship was at age 50, after which we began to see a decline. Interestingly, there was another upward trend after age 80.

Prediction

# Keep only the subscribers in the dataset 
bbikemember <- bbike %>%
  ungroup() %>%
  filter(usertype == "Subscriber" & duration_min < 75) %>%
  mutate(gender = as.factor(gender), day = as.factor(day),
         hour = as.factor(hour)) %>%
  mutate(rain_cat = ifelse(rain == 0, 0, 1)) %>%
  mutate(snownice_cat = ifelse(snownice == 0, 0, 1)) %>%
  mutate(overtime = ifelse(duration_min < 15, 0, 1)) %>%
  mutate(overtime = as.factor(overtime)) %>%
  mutate(satsun = ifelse(wday %in% c("Sat", "Sun"), 1, 0)) %>%
  mutate(satsun = as.factor(satsun)) %>%
  filter(gender != 0)

Decision tree

output_tree1 <- ctree(overtime ~ gender + satsun,
                     data = bbikemember)
plot(output_tree1)

set.seed(6)
s <- sample(1:nrow(bbikemember), 500000)

dt_train = bbikemember[s,]
dt_test = bbikemember[-s,]

dtm = ctree(overtime ~ gender + factor(rain_cat) + factor(snownice_cat), data = dt_train)

plot(dtm)

dt_test$predClass = predict(dtm, newdata = dt_test, type = "response") 
dt_test$predProb = sapply(predict(dtm, newdata = dt_test, type= "prob"),'[[',2) 

dt_test$predClass2 = 0
dt_test$predClass2[dt_test$predProb >= 0.3] = 1

table(dt_test$predClass2, dt_test$overtime)
##    
##          0      1
##   0 340631 110554
##   1  97432  45622

Based on the tree plot, we might roughly estimate that women tend to ride over 45 minutes during no rainy or snowy days. Both female and male are less likely to over-ride.

roc_pred <- prediction(dt_test$predProb, dt_test$overtime)
pf <- performance(roc_pred, "tpr", "fpr")
plot(pf, col = "#FF8F1C")
abline(0, 1, col="grey")

# getting area under the curve
performance(roc_pred,"auc")@y.values
## [[1]]
## [1] 0.5428584

However, after checking the performance of the tree model, we figured out that the model did not seem to predict well because the AUC is 0.5428584. To find more accurate model, we proceeded to the logistic regression by putting more predictors in our new model.

logistic regression

set.seed(1)

Train <- createDataPartition(bbikemember$overtime, p = 0.5, list = FALSE)
training <- bbikemember[ Train, ]
testing <- bbikemember[ -Train, ]

#overall accuracy 
p = 0.358  #290235/810623
y_hat <- sample(c("0","1"), length(testing), replace = TRUE, prob=c(p, 1-p)) %>% 
  factor(levels = levels(bbikemember$overtime))
mean(y_hat == testing$overtime)
## [1] 0.4657506
#logistic regression
glm.fit <- training %>% 
  glm(overtime ~ gender + age + month + satsun + rain_cat +
               snownice_cat + user_start + user_end, data=., family = "binomial")

summary(glm.fit)
## 
## Call:
## glm(formula = overtime ~ gender + age + month + satsun + rain_cat + 
##     snownice_cat + user_start + user_end, family = "binomial", 
##     data = .)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3354  -0.8276  -0.7025   1.3495   2.7237  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -7.370e-01  1.620e-02 -45.484  < 2e-16 ***
## gender2       3.173e-01  7.022e-03  45.180  < 2e-16 ***
## age           7.089e-03  2.714e-04  26.115  < 2e-16 ***
## month         5.586e-03  1.246e-03   4.482  7.4e-06 ***
## satsun1       2.488e-01  7.862e-03  31.645  < 2e-16 ***
## rain_cat     -8.604e-02  6.848e-03 -12.564  < 2e-16 ***
## snownice_cat -6.166e-01  4.143e-02 -14.884  < 2e-16 ***
## user_start   -2.917e-05  4.184e-07 -69.709  < 2e-16 ***
## user_end     -2.724e-05  3.893e-07 -69.985  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 630108  on 547119  degrees of freedom
## Residual deviance: 611999  on 547111  degrees of freedom
## AIC: 612017
## 
## Number of Fisher Scoring iterations: 4
exp(cbind(OR=coef(glm.fit), confint.default(glm.fit)))
##                     OR     2.5 %    97.5 %
## (Intercept)  0.4785536 0.4635945 0.4939954
## gender2      1.3733690 1.3545963 1.3924017
## age          1.0071142 1.0065785 1.0076502
## month        1.0056015 1.0031481 1.0080609
## satsun1      1.2824760 1.2628654 1.3023912
## rain_cat     0.9175533 0.9053197 0.9299521
## snownice_cat 0.5397628 0.4976664 0.5854201
## user_start   0.9999708 0.9999700 0.9999717
## user_end     0.9999728 0.9999720 0.9999735
#prediction
p_hat <- predict(glm.fit, newdata=testing, type="response")
y_hat <- ifelse(p_hat > 0.5, 1, 0) %>% factor()

#confusion matrix 
table(predicted = y_hat, actual = testing$overtime)
##          actual
## predicted      0      1
##         0 403056 143450
##         1    346    267
confusionMatrix(data = factor(y_hat), reference = factor(testing$overtime))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction      0      1
##          0 403056 143450
##          1    346    267
##                                          
##                Accuracy : 0.7372         
##                  95% CI : (0.736, 0.7383)
##     No Information Rate : 0.7373         
##     P-Value [Acc > NIR] : 0.5966         
##                                          
##                   Kappa : 0.0015         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.999142       
##             Specificity : 0.001858       
##          Pos Pred Value : 0.737514       
##          Neg Pred Value : 0.435563       
##              Prevalence : 0.737320       
##          Detection Rate : 0.736688       
##    Detection Prevalence : 0.998880       
##       Balanced Accuracy : 0.500500       
##                                          
##        'Positive' Class : 0              
## 

ROC

model<-training %>%
glm(overtime ~ gender + age + month + satsun + rain_cat +
             snownice_cat + user_start + user_end, data=., family = "binomial")

predpr <- predict(model, newdata=testing, type="response")
roccurve<- roc(testing$overtime~predpr)
roccurve
## 
## Call:
## roc.formula(formula = testing$overtime ~ predpr)
## 
## Data: predpr in 403402 controls (testing$overtime 0) < 143717 cases (testing$overtime 1).
## Area under the curve: 0.6119
predict <- predict(model, newdata=testing, type = "response")

ROCRpred <- prediction(predict, testing$overtime)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, col = "#0050B5", text.adj = c(-0.2,1.7))
abline(0, 1, col = "grey")

In this logistic regression model, we put gender, age, weekends (versus weekdays), rain, snow, the popularity of starting stations, and the popularity of ending stations to predict overtime riders. We created training dataset for building a model and testing dataset for check the performance of the model. Because of the big sample size, we got statistically higher or lower odds of over-riding in all of the predictors. However, if we look at the odds ratio, we could realize something interesting. The odds of over-riding when it snowed is 0.54 (95% CI: 0.50 to 0.59) times the odds when it did not. The odds of over-riding when it rained is 0.92 (95% CI: 0.91 to 0.93) times the odds when it did not. If it rains or snows, people tend to return bike on time. On the other hand, the odds of over-riding on weekends is 1.28 (95% CI: 1.26 to 1.30) times the odds on weekdays. The odds of over-riding among females is 1.37 (95% CI: 1.35 to 1.39) times the odds among males. These data shows females tend to use the Bluebikes longer than 45 minutes especially during weekends. Finally we drew the ROC curve using the testing dataset. However, AUC is 0.61, which means this model is not so good model to predict over-ride.